AITopics | old policy

LearningtoConstrainPolicyOptimizationwith VirtualTrustRegion

Neural Information Processing SystemsFeb-9-2026, 00:07:39 GMT

ComparedtoDeepQ-learning,deeppolicygradient (PG) methods are often more flexible and applicable to discrete and continuous action problems. However, these methods tend to suffer from high sample complexity and training instability since the gradient may not accurately reflect the policy gain when the policy changes substantially [6].

artificial intelligence, machine learning, virtual policy, (16 more...)

Neural Information Processing Systems

Country: Oceania > Australia (0.14)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Robots (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.46)

Add feedback

Learning to Constrain Policy Optimization with Virtual Trust Region

Neural Information Processing SystemsAug-14-2025, 21:12:50 GMT

We introduce a constrained optimization method for policy gradient reinforcement learning, which uses a virtual trust region to regulate each policy update. In addition to using the proximity of one single old policy as the normal trust region, we propose forming a second trust region through another virtual policy representing a wide range of past policies. We then enforce the new policy to stay closer to the virtual policy, which is beneficial if the old policy performs poorly. More importantly, we propose a mechanism to automatically build the virtual policy from a memory of past policies, providing a new capability for dynamically learning appropriate virtual trust regions during the optimization process. Our proposed method, dubbed Memory-Constrained Policy Optimization (MCPO), is examined in diverse environments, including robotic locomotion control, navigation with sparse rewards and Atari games, consistently demonstrating competitive performance against recent on-policy constrained policy gradient methods.

mcpo, policy optimization, virtual policy, (14 more...)

Neural Information Processing Systems

Country: Oceania > Australia (0.14)

Genre: Research Report > New Finding (0.46)

Industry: Leisure & Entertainment > Games > Computer Games (0.55)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.90)
Information Technology > Artificial Intelligence > Robots (0.88)

Add feedback

On the Theory and Practice of GRPO: A Trajectory-Corrected Approach with Fast Convergence

Pang, Lei, Jin, Ruinan

arXiv.org Artificial IntelligenceAug-8-2025

Group Relative Policy Optimization (GRPO), recently proposed by DeepSeek, is a critic-free reinforcement learning algorithm for fine tuning large language models. It replaces the value function in Proximal Policy Optimization (PPO) with group normalized rewards, while retaining PPO style token level importance sampling based on an old policy. We show that GRPO update rule in fact estimates the policy gradient at the old policy rather than the current one. However, since the old policy is refreshed every few steps, the discrepancy between the two remains small limiting the impact of this bias in practice. We validate this through an ablation study in which importance sampling is entirely removed, and updates are instead performed using the gradient estimated at a fixed old policy across multiple optimization steps. Remarkably, this simplification results in performance comparable to standard GRPO. Motivated by these findings, we propose a new algorithm: Trajectory level Importance Corrected GRPO (TIC GRPO). TIC GRPO replaces token level importance ratios with a single trajectory level probability ratio, yielding an unbiased estimate of the current policy gradient while preserving the critic free structure. Furthermore, we present the first theoretical convergence analysis for GRPO style methods, covering both the original GRPO and our proposed variant.

grpo, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2508.02833

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.55)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Add feedback

LAPD allowed to use drones as 'first responders' under new program

Los Angeles TimesJun-24-2025, 23:20:26 GMT

Citing successes other police departments across the country have seen using drones, the Los Angeles Police Commission said it would allow the LAPD to deploy unmanned aircraft on routine emergency calls. The civilian oversight body approved an updated policy Tuesday allowing drones to be used in more situations, including "calls for service." The new guidelines listed other scenarios for future drone use -- "high-risk incident, investigative purpose, large-scale event, natural disaster" -- and transferred their command from the Air Support Division to the Office of Special Operations. Previously, the department's nine drones were restricted to a narrow set of dangerous situations, most involving barricaded suspects or explosives. Bryan Lium told commissioners the technology offers responding officers and their supervisors crucial, real-time information about what type of threats they might encounter while responding to an emergency.

artificial intelligence, drone, new policy, (14 more...)

Los Angeles Times

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.57)
North America > United States > California > Los Angeles County > Beverly Hills (0.07)
North America > United States > California > Los Angeles County > Culver City (0.05)

Industry: Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)

Technology: Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles > Drones (1.00)

Add feedback

Reinforcement Learning with Verifiable Rewards: GRPO's Effective Loss, Dynamics, and Success Amplification

Mroueh, Youssef

arXiv.org Machine LearningMar-14-2025

Group Relative Policy Optimization (GRPO) was introduced and used successfully to train DeepSeek R1 models for promoting reasoning capabilities of LLMs using verifiable or binary rewards. We show in this paper that GRPO with verifiable rewards can be written as a Kullback Leibler ($\mathsf{KL}$) regularized contrastive loss, where the contrastive samples are synthetic data sampled from the old policy. The optimal GRPO policy $\pi_{n}$ can be expressed explicitly in terms of the binary reward, as well as the first and second order statistics of the old policy ($\pi_{n-1}$) and the reference policy $\pi_0$. Iterating this scheme, we obtain a sequence of policies $\pi_{n}$ for which we can quantify the probability of success $p_n$. We show that the probability of success of the policy satisfies a recurrence that converges to a fixed point of a function that depends on the initial probability of success $p_0$ and the regularization parameter $\beta$ of the $\mathsf{KL}$ regularizer. We show that the fixed point $p^*$ is guaranteed to be larger than $p_0$, thereby demonstrating that GRPO effectively amplifies the probability of success of the policy.

cosh 2, grpo, probability, (13 more...)

arXiv.org Machine Learning

2503.06639

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.52)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.35)

Add feedback

To Switch or Not to Switch? Balanced Policy Switching in Offline Reinforcement Learning

Ma, Tao, Yang, Xuzhi, Szabo, Zoltan

arXiv.org Machine LearningJul-1-2024

Reinforcement learning (RL) -- finding the optimal behaviour (also referred to as policy) maximizing the collected long-term cumulative reward -- is among the most influential approaches in machine learning with a large number of successful applications. In several decision problems, however, one faces the possibility of policy switching -- changing from the current policy to a new one -- which incurs a non-negligible cost (examples include the shifting of the currently applied educational technology, modernization of a computing cluster, and the introduction of a new webpage design), and in the decision one is limited to using historical data without the availability for further online interaction. Despite the inevitable importance of this offline learning scenario, to our best knowledge, very little effort has been made to tackle the key problem of balancing between the gain and the cost of switching in a flexible and principled way. Leveraging ideas from the area of optimal transport, we initialize the systematic study of policy switching in offline RL. We establish fundamental properties and design a Net Actor-Critic algorithm for the proposed novel switching formulation. Numerical experiments demonstrate the efficiency of our approach on multiple benchmarks of the Gymnasium.

net q-function, net value, old policy, (14 more...)

arXiv.org Machine Learning

2407.01837

Country:

Europe > United Kingdom (0.04)
Asia > Middle East > Jordan (0.04)

Genre:

Research Report (0.64)
Instructional Material (0.48)

Industry: Education (0.66)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Adaptive Proximal Policy Optimization with Upper Confidence Bound

Zhang, Ziqi, Xu, Jingzehua, Zhuang, Zifeng, Liu, Jinxin, wang, Donglin

arXiv.org Artificial IntelligenceDec-12-2023

Trust Region Policy Optimization (TRPO) attractively optimizes the policy while constraining the update of the new policy within a trust region, ensuring the stability and monotonic optimization. Building on the theoretical guarantees of trust region optimization, Proximal Policy Optimization (PPO) successfully enhances the algorithm's sample efficiency and reduces deployment complexity by confining the update of the new and old policies within a surrogate trust region. However, this approach is limited by the fixed setting of surrogate trust region and is not sufficiently adaptive, because there is no theoretical proof that the optimal clipping bound remains consistent throughout the entire training process, truncating the ratio of the new and old policies within surrogate trust region can ensure that the algorithm achieves its best performance, therefore, exploring and researching a dynamic clip bound for improving PPO's performance can be quite beneficial. To design an adaptive clipped trust region and explore the dynamic clip bound's impact on the performance of PPO, we introduce an adaptive PPO-CLIP (Adaptive-PPO) method that dynamically explores and exploits the clip bound using a bandit during the online training process. Furthermore, ample experiments will initially demonstrate that our Adaptive-PPO exhibits sample efficiency and performance compared to PPO-CLIP.

adaptive-ppo, old policy, trust region, (14 more...)

arXiv.org Artificial Intelligence

2312.07624

Country: Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.64)

Industry: Education > Educational Setting > Online (0.56)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)

Add feedback

Learning to Constrain Policy Optimization with Virtual Trust Region

Le, Hung, George, Thommen Karimpanal, Abdolshah, Majid, Nguyen, Dung, Do, Kien, Gupta, Sunil, Venkatesh, Svetha

arXiv.org Artificial IntelligenceSep-15-2022

We introduce a constrained optimization method for policy gradient reinforcement learning, which uses a virtual trust region to regulate each policy update. In addition to using the proximity of one single old policy as the normal trust region, we propose forming a second trust region through another virtual policy representing a wide range of past policies. We then enforce the new policy to stay closer to the virtual policy, which is beneficial if the old policy performs poorly. More importantly, we propose a mechanism to automatically build the virtual policy from a memory of past policies, providing a new capability for dynamically learning appropriate virtual trust regions during the optimization process. Our proposed method, dubbed Memory-Constrained Policy Optimization (MCPO), is examined in diverse environments, including robotic locomotion control, navigation with sparse rewards and Atari games, consistently demonstrating competitive performance against recent on-policy constrained policy gradient methods.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

arXiv.org Artificial Intelligence

2204.09315

Country: Oceania > Australia (0.14)

Genre: Research Report > New Finding (0.68)

Industry: Leisure & Entertainment > Games (0.55)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.89)
Information Technology > Artificial Intelligence > Robots (0.87)

Add feedback

Policy Optimizations: TRPO/PPO

#artificialintelligenceSep-18-2021, 12:00:09 GMT

In this post, I will be talking about policy optimization methods from the papers Trust Region Policy Optimization (Schulman et al. 2015) and Proximal Policy Optimization Algorithms (Schulman et al. 2017). I will then briefly go over the Trust Region Policy Optimization method and two types of Proximal Policy Optimization methods: adaptive KL (Kullback-Leibler) penalties to the surrogate objective and clipped surrogate objective. In a traditional policy gradient method, we sample a trajectory of states, actions, and rewards, then update the policy using the sampled trajectories. While this method is great and solves basic control problems, the algorithm tends to be unstable and is inconsistent in solving an environment. A problem is that as we are updating the policy, the distribution of the inputs and outputs of the approximated policy distribution will change, resulting in instability.

new policy, objective, surrogate objective, (15 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)

Add feedback

Policy Optimizations: TRPO/PPO

#artificialintelligenceSep-17-2021, 22:20:13 GMT

In this post, I will be talking about policy optimization methods from the papers Trust Region Policy Optimization (Schulman et al. 2015) and Proximal Policy Optimization Algorithms (Schulman et al. 2017). I will then briefly go over the Trust Region Policy Optimization method and two types of Proximal Policy Optimization methods: adaptive KL (Kullback-Leibler) penalties to the surrogate objective and clipped surrogate objective. In a traditional policy gradient method, we sample a trajectory of states, actions, and rewards, then update the policy using the sampled trajectories. While this method is great and solves basic control problems, the algorithm tends to be unstable and is inconsistent in solving an environment. A problem is that as we are updating the policy, the distribution of the inputs and outputs of the approximated policy distribution will change, resulting in instability.

new policy, objective, surrogate objective, (15 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)

Add feedback

Filters

Collaborating Authors

old policy

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

LearningtoConstrainPolicyOptimizationwith VirtualTrustRegion

Learning to Constrain Policy Optimization with Virtual Trust Region

On the Theory and Practice of GRPO: A Trajectory-Corrected Approach with Fast Convergence

LAPD allowed to use drones as 'first responders' under new program

Reinforcement Learning with Verifiable Rewards: GRPO's Effective Loss, Dynamics, and Success Amplification

To Switch or Not to Switch? Balanced Policy Switching in Offline Reinforcement Learning

Adaptive Proximal Policy Optimization with Upper Confidence Bound

Learning to Constrain Policy Optimization with Virtual Trust Region

Policy Optimizations: TRPO/PPO

Policy Optimizations: TRPO/PPO